Goto

Collaborating Authors

 global minimizer


From Saddle Points Toward Global Minima: A Newton-Type Method on Wasserstein Space

arXiv.org Machine Learning

We study the minimization of non-convex functionals over the Wasserstein space. While recent work has showed that perturbed Wasserstein gradient methods can avoid saddle points for benign landscapes, existing approaches remain essentially first-order and do not provide fast local convergence once the iterates enter a neighborhood of a global minimizer. We propose Wasserstein Saddle-Free Newton (WSFN), a second-order method that preconditions the Wasserstein gradient by a regularized square root of the squared Wasserstein Hessian. This construction preserves attraction toward directions of positive curvature while inducing repulsion along directions of negative curvature, thereby overcoming the tendency of standard Wasserstein Newton dynamics to be attracted to saddles. We also establish second-order sufficient optimality conditions on Wasserstein space for strict local minimality. Under regularity and benign landscape assumptions, we prove that WSFN escapes saddle regions and reaches an $ฮฑ$-neighborhood of a global minimizer in polynomial time, with improved dependence on saddle parameters compared with prior perturbed first-order methods. Once inside this neighborhood, we show that WSFN converges linearly in $L^2$-Wasserstein distance to a non-degenerate global minimizer. Finally, we present a particle-based implementation of the method.


statements and

Neural Information Processing Systems

Let a two-player Markov game where both players affect the transition. We will effectively show that the problem of best-responding to a correlated policy ฯƒ is526 equivalent to best-responding to the marginal policy of ฯƒ for the opponent. The proof follows from527 the equivalence of the two MDPs.528 Before that, given a (possibly correlated) joint policy ฯƒ we define a nonlinear program, (PBR), whose539 optimal solutions are best-response policies of each agent k to ฯƒ k and the values for each state s540 and timestep h:541 A.2 Proof of Theorem 3.2542 The best-response program. First, we state the following lemma that will prove useful for several543 of our arguments,544 Lemma A.1 (Best-response LP).


When Expressivity Meets Trainability: Fewer than n Neurons Can Work

Neural Information Processing Systems

Modern neural networks are often quite wide, causing large memory and computation costs. It is thus of great interest to train a narrower network. However, training narrow neural nets remains a challenging task. We ask two theoretical questions: Can narrow networks have as strong expressivity as wide ones? If so, does the loss function exhibit a benign optimization landscape?


Sparse Network Inference under Imperfect Detection and its Application to Ecological Networks

arXiv.org Machine Learning

Abstract--Recovering latent structure from count data has received considerable attention in network inference, particularly when one seeks both cross-group interactions and within-group similarity patterns in bipartite networks, which is widely used in ecology research. Such networks are often sparse and inherently imperfect in their detection. Existing models mainly focus on interaction recovery, while the induced similarity graphs are much less studied. Moreover, sparsity is often not controlled, and scale is unbalanced, leading to oversparse or poorly rescaled estimates with degrading structural recovery. We impose nonconvex โ„“1/2 regularization on the latent similarity and connectivity structures to promote sparsity within-group similarity and cross-group connectivity with better relative scale. To solve it, we develop an ADMM-based algorithm with adaptive penalization and scale-aware initialization and establish its asymptotic feasibility and KKT stationarity of cluster points under mild regularity conditions. Experiments on synthetic and real-world ecological datasets demonstrate improved recovery of latent factors and similarity/connectivity structure relative to existing baselines. Index Terms--augmented Lagrangian, nonconvex nonsmooth optimization, nonnegative matrix factorization, link prediction, ecological network inference, structured sparse recovery I. INTRODUCTION This setting is inherent in sensing and monitoring applications [3], [4], where observations, such as counts, are obtained via an imperfect sampling process. In this paper, we are interested in ecological interaction networks describing how species associate with locations and how environments shape biodiversity patterns [5], [6].


Phase transitions in Doi-Onsager, Noisy Transformer, and other multimodal models

arXiv.org Machine Learning

We study phase transitions for repulsive-attractive mean-field free energies on the circle. For a $\frac{1}{n+1}$-periodic interaction whose Fourier coefficients satisfy a certain decay condition, we prove that the critical coupling strength $K_c$ coincides with the linear stability threshold $K_\#$ of the uniform distribution and that the phase transition is continuous, in the sense that the uniform distribution is the unique global minimizer at criticality. The proof is based on a sharp coercivity estimate for the free energy obtained from the constrained Lebedev--Milin inequality. We apply this result to three motivating models for which the exact value of the phase transition and its (dis)continuity in terms of the model parameters was not fully known. For the two-dimensional Doi--Onsager model $W(ฮธ)=-|\sin(2ฯ€ฮธ)|$, we prove that the phase transition is continuous at $K_c=K_\#=3ฯ€/4$. For the noisy transformer model $W_ฮฒ(ฮธ)=(e^{ฮฒ\cos(2ฯ€ฮธ)}-1)/ฮฒ$, we identify the sharp threshold $ฮฒ_*$ such that $K_c(ฮฒ) = K_\#(ฮฒ)$ and the phase transition is continuous for $ฮฒ\leq ฮฒ_*$, while $K_c(ฮฒ) ฮฒ_*$. We also obtain the corresponding sharp dichotomy for the noisy Hegselmann--Krause model $W_{R}(ฮธ) = (R-2ฯ€|ฮธ|)_{+}^2$ .


Markov Equivalence and Consistency in Differentiable Structure Learning

Neural Information Processing Systems

Existing approaches to differentiable structure learning of directed acyclic graphs (DAGs) rely on strong identifiability assumptions in order to guarantee that global minimizers of the acyclicity-constrained optimization problem identifies the true DAG. Moreover, it has been observed empirically that the optimizer may exploit undesirable artifacts in the loss function. We explain and remedy these issues by studying the behavior of differentiable acyclicity-constrained programs under general likelihoods with multiple global minimizers. By carefully regularizing the likelihood, it is possible to identify the sparsest model in the Markov equivalence class, even in the absence of an identifiable parametrization. We first study the Gaussian case in detail, showing how proper regularization of the likelihood defines a score that identifies the sparsest model. Assuming faithfulness, it also recovers the Markov equivalence class. These results are then generalized to general models and likelihoods, where the same claims hold. These theoretical results are validated empirically, showing how this can be done using standard gradient-based optimizers (without resorting to approximations such as Gumbel-Softmax), thus paving the way for differentiable structure learning under general models and losses.


A Missing statements and proofs 521 A.1 Statements for Section 3.1

Neural Information Processing Systems

Let a two-player Markov game where both players affect the transition. As we have seen in Section 2.1, in the case of unilateral deviation from joint policy Let a (possibly correlated) joint policy ห† ฯƒ . By Lemma A.1, we know that Where the equality holds due to the zero-sum property, (1). An approximate NE is an approximate global minimum. An approximate global minimum is an approximate NE.


Transformers learn to implement preconditioned gradient descent for in-context learning

Neural Information Processing Systems

Several recent works demonstrate that transformers can implement algorithms like gradient descent. By a careful construction of weights, these works show that multiple layers of transformers are expressive enough to simulate iterations of gradient descent.